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(b) All the claims are believed to be directed to a single invention. If the 
Office determines that all the claims presented are not obviously directed to a single 
invention, then Applicants will make an election without traverse as a prerequisite to the 
grant of special status. 

(c) Pre-examination searches were made of U.S. issued patents, including 
a classification search and a key word search. The classification search was conducted on or 
around August 16, 2004 covering Class 714 (subclasses 6, 7, 42, 52, 54, and 800) and Class 
711 (subclasses 113, 114, and 162), by a professional search firm, Lacasse & Associates, 
LLC. The key word search was performed on the USPTO full-text database including 
published U.S. patent applications. The inventors further provided three references 
considered most closely related to the subject matter of the present application (see references 
#7-9 below), which were cited in the Information Disclosure Statement filed with the 
application on February 9, 2004. 

(d) The following references, copies of which are attached herewith, are 
deemed most closely related to the subject matter encompassed by the claims: 

(1) U.S. Patent No. 6,070,249; 

(2) U.S. Patent No. 6,243,827 Bl; 

(3) U.S. Patent No. 6,442,71 1 Bl ; 

(4) U.S. Patent No. 6,647,5 14 Bl ; 

(5) U.S. Patent Publication No. 2003/0056142 Al ; 

(6) U.S. Patent Publication No. 2003/0188101 Al; 

(7) U.S. Patent No. 5,61 1,069; 

(8) Japanese Patent Publication No. JP 08-1471 12; and 

(9) David A. Patterson et al., "A Case for Redundant Arrays of 
Inexpensive Disks (RAID)," Computer Science Division, Dept. 
of Electrical Engineering and Computer Science, University of 
California, Berkeley, 1988. 
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(e) Set forth below is a detailed discussion of references which points out 
with particularity how the claimed subject matter is distinguishable over the references. 

A. Claimed Embodiments of the Present Invention 

The claimed embodiments relate to a disk drive which is an external memory 
device for a computer, and, more particularly, to a technique for preventing a plurality of disk 
drives in an array-type disk apparatus constituting a disk array from failing simultaneously 
and a technique for improving the host I/O response and improving the reliability at the time 
of data shifting among disk drives constituting a disk array group having a redundancy. 

Independent claim 22 recites a storage system comprising a plurality of disks 
including first type disks configuring a RAID group and at least one second type disk, 
wherein each of the first type disks stores one of data received from a computer coupled to 
the storage system or parity data used for recovering the data received from the computer, and 
wherein the at least one second type disk is used as a spare disk for storing copy data of data 
stored in one of the first type disks; and a control section configured to hold an error status of 
each of the first type disks, start to mirror data between one of the first type disks and the at 
least one second type disk when the error status of the one of the first type disks matches a 
predetermined first criterion. After starting to mirror data between the one of the first type 
disks and the at least one second type disk, the control section is configured to stop mirroring 
data between the one of the first type disks and the at least one second type disk and start to 
mirror data between another one of the first type disks and the at least one second type disk, 
according to the error status of the one of the first type disks and the another one of the first 
type disks. 

In this claimed embodiment, the first type disk to be configured to be 
mirroring pair with the second type disk is switched according to the error status of each first 
type disk. As a disk drive to be mirrored is dynamically switched, this operation is called 
"dynamic mirroring operation." 

One of the benefits that may be derived is that it provides a highly reliable 
array-type disk apparatus which copies data to a spare disk drive for a possible failure and 
reduces the probability of occurrence of a 2 disk drives failure without involving a cost 
increase for spare disk drives. 
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B. Discussion of the References 

1. U.S. Patent No. 6,070,249 

This reference discloses a split party spare disk achieving method in raid 
subsystem. Discussed is a split parity disk achieving method for improving the defect 
endurance and performance of a RAID subsystem which distributively stores data in a disk 
array. Method may consist of constructing the disk array with at least two data disk drives 
for storing data, a spare disk drive used when a disk drive fails and a parity disk drive for 
storing parity data; and splitting the parity data of the parity disk drive and storing the split 
data in the parity disk drive and the spare disk drive. See column 3, line 61 to column 4, line 
4. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
is switched. 

2. U.S. Patent No. 6,243,827 Bl 

This reference relates to a multiple-channel failure detection in raid systems. 
Discussed are the use of software and a small portion of each disk in an array to write a bad 
area table on each disk. The embodiments may facilitate recovery of a RAID storage system 
from simultaneous failure of two or more disks. See column 4, lines 34-39. This reference 
does not appear to specifically utilize spare disks for failure protection, although it does 
address the situation of failure recovery from multiple disk failures. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
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second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
is switched. 

3. U.S. Patent No. 6.442.71 1 Bl 

This reference discloses a system and a method for avoiding storage failures in 
a storage array system. Discussed is a method for executing preventive maintenance of the 
conventional storage array system. A storage array system comprises a plurality of data 
storage devices for storing data, a spare storage device for replacing one of the plurality of 
data storage devices, and a control unit for controlling input and output operations. The 
control unit may include means forjudging a necessity to execute preventive maintenance of 
each of the plurality of data storage devices by looking at the error rate. See column 2, lines 
6-9, 22-27, and 32-35. 

The disk array system includes data disks, parity disk, and spare disk. See Fig. 
2. In this disk array system, the necessity to execute maintenance is judged by the error rate 
of each disk. If the result of the judgment is in need, data of the disk is copied to the spare 
disk. See Fig. 7. There is, however, no disclosure of switching the copy-originated disk (i.e., 
the disk to be mirrored with the spare disk) configured for copying the data to the spare disk 
by considering the error rate of the other disks. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
is switched. 
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4. U.S. Patent No. 6.647,514 Bl 

This reference discloses host I/O performance and availability of a storage 
array during rebuild by prioritizing I/O request. Discussed are rebuild I/O requests which 
may be given priority over host I/O requests when the storage array is close to permanently 
losing data (for example, failure of one more particular disk in the storage array would result 
in data loss). Due to different RAID levels, failures of different disks can result in different 
RAID levels being rebuilt. Examples of rebuilding data in an array include migrating to 
other disks and/or RAID levels, or writing data to a spare disk. See column 3, lines 47-5 1 ; 
and column 6, lines 11-13 and 17-23. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare 
disk is switched. 

5. U.S. Patent Publication No. 2003/0056142 Al 

This reference relates to a method and a system for leveraging spares in a data 
storage system including a plurality of disk drives. Disclosed is a method and system for 
leveraging spare disks for data redundancy in response to failure in a data storage system. 
The data storage system may be grouped into a plurality of arrays having data redundancy. 
The plurality of arrays may be arranged to maximize the number of arrays that are mirrored 
pairs of disk drives. In another embodiment, the plurality of arrays may be arranged in an 
optimum combination of arrays of mirrored pairs of disk drives. For every failure of one of 
said plurality of arrays due to a filed disk drive, a new array having data redundancy in a 
RAID configuration is created in the plurality of arrays. See paragraphs [0022]-[0025]. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
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stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation, 11 in which a disk drive to be mirrored with the spare disk 
is switched. 

6. U.S. Patent Publication No. 2003/0188101 Al 

This reference discloses partial mirroring during expansion thereby eliminating 
the need to track the progress of stripes updated during expansion. Discussed is a method of 
mirroring data form a write request to a spare unit corresponding to a stripe unit in a spare 
disk being rebuilt during expansion. The disk array may already be configured to enter a 
compaction state upon failure of replaced or spare disk during the expansion process since the 
spare units contain a copy of the valid data stored in the corresponding stripe units in the 
replaced or spare disk. See paragraphs [0014] and [0023]. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
is switched. 

7. U.S. Patent No. 5.611,069 

This reference discloses a disk array apparatus which predicts errors using 
mirror disks that can be accessed in parallel. A mirror disk unit 36-1 in which two disk units 
are provided as one set is used as a component element of the disk array. The two disk units 
of the mirror disk unit 36-1 are allocated to the disk unit for the present use 32-1 and the disk 
unit for spare 32-2. Data is written into both the presently used disk unit 32-1 and the spare 
disk unit 32-2. Data is read out from the present use disk unit 32-1 . See Fig. 1 and column 8, 
lines 5-24. The occurrence of a fault of the disk unit is judged and the allocation is switched 
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from the present use disk unit to the spare disk unit. In an idle state, a simulation to check the 
disk array is executed and fault information is collected. See column 12, lines 5-53. The 
present use disk unit is not constructed as a mirror disk and when the fault is judged in the 
present use disk unit, the data is copied to the spare disk unit, thereby dynamically realizing a 
mirror disk construction. See Figure 18 and column 16, lines 18-21 and 40-47. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
is switched. 

8. Japanese Patent Publication No. JP 08-1471 12 

This reference discloses a technique to efficiently perform the error recovery 
work by automatically performing the recovery processing. If the frequency in error 
occurrence of one of disk devices 50-57 for data storage and a disk device 58 for redundant 
information storage in a disk array 5 exceeds a prescribed value, data of the disk device 
where error occurs is restored into an auxiliary disk device 59 by a first data restoration part 
46; and when the restoration operation of this part 46 is completed, a reinitializing part 47 
initializes (formats) the medium of the disk device where the error occurs. After initialization 
of the reinitializing part 47 is completed, a medium check part 48 checks the medium of the 
disk device where the error occurs. A second data restoration part 49 restores data of the 
auxiliary disk device 59 into an error disk device when it is discriminated by the medium 
check part 48 that the medium is normal. 

As discussed in the present application at page 2, line 27 to page 3, line 27, the 
reference discloses a technique which copies data of a disk drive to its spare disk drive and 
restores the data in the spare disk drive in case where the number of errors occurred in that 
disk drive exceeds a specified value. Further, the conventional array-type disk apparatus has 
an operational flow such that when a data read failure occurs frequently in a disk drive from 
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which data is shifted (hereinafter called "data-shifting disk drive") at the time of shifting data 
to the spare disk drive of the disk drive due to preventive maintenance or so, data read from 
the data-shifting disk drive is attempted and after a data read failure is detected, the data in 
the data-shifting disk drive is restored by the disk drive that has redundancy using the data 
restoring function of the array-type disk apparatus. It is therefore expected that the prior art 
drive suffers a slower response to the data read request from the host computer. To avoid the 
response drop, it is typical to perform the process of coping with the data read request from 
the host computer using only the system which isolates the data-shifting disk drive from the 
array-type disk apparatus when a data read error has occurred frequency in the data-shifting 
disk drive and restores the data in the data-shifting disk drive by means of the redundant disk 
drive by using the data restoring function of the array-type disk apparatus. 

The reference does not teach that after starting to mirror data between the one 
of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
is switched. 

9. David A. Patterson et al.« "A Case for Redundant Arrays of Inexpensive Disks 
(RAIDy Computer Science Division, Dept. of Electrical Engineering and 
Computer Science, University of California, Berkeley, 1988 

This reference discloses redundant arrays of inexpensive disks (RAID). As 
discussed in the present application at page 1, line 16 to page 2, line 5, the reference discloses 
an array-type disk apparatus known as a RAID (Redundant Arrays of Inexpensive Disks) and 
is a memory device which has a plurality of disk drives laid out in an array and a control 
section to control the disk drives. In the array-type disk apparatus, a read request (data read 
request) and a write request (data write request) are processed fast by the parallel operation of 
the disk drives and redundancy is added to data. Array-type disk apparatuses are classified 
into five levels according to the type of redundant data to be added and the structure. 
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The reference does not teach that after starting to mirror data between the one 



of the first type disks and the at least one second type disk, the control section is configured to 
stop mirroring data between the one of the first type disks and the at least one second type 
disk and start to mirror data between another one of the first type disks and the at least one 
second type disk, according to the error status of the one of the first type disks and the 
another one of the first type disks, as recited in independent claim 22. There is no disclosure 
of the "dynamic mirroring operation," in which a disk drive to be mirrored with the spare disk 
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[0 0 0 8] L^u «git;$-3T}i ^Sfcigc: 

S'JOSt^S^f'-fe^ h VxH — i^r ^lXLTfrW 

nuts:*,-?, r-i y9mwv>&m j $>. -i x<d 

rc®<DA^\Z±Z>ftM&&mtVT^tcfz.t6. T^i' 

tt. 

\z^-r^m^m^mtvxmMmx^mx^^^o 

[0 0 10] 

[RH*»ft-rsfc8e>©*g] 0 1 te#f£Bj©nii!&Bjf 

0-57 £5Tl:g1lS$gf2ttffl^X^gfi5 8 SflAfcT* 
-fXi'TH 5£KteU _hfi^ai*^©T^-trXtc 
*fLT«^OSS^^X^^B5 6~5 8£3£?U7^-te 
Xtif-<7i'7Kaiif2 4#U It, r^X 
^7W 5«4>£i:< l#C?lT^X^gl5 9* 

[0 0 1 1] Z.<D&5tZ : T'(7,f7\<"tmW.<DX.y— 0 

121:, ^15 ; -^«7ngB4 6, S^f-->^ 5^X364 
7, JSI#«aE«4 8RtfJB2 5 J — ^tt5£aS4 9ftRJt . 
■5. ^ 15 s - ^75584 6 tt, TV 7. ^ 7 K 5 (D?— 
ZW&m : r4 7,>?*m.h 0~5 7RZ/?LS1»^fB«ffl^ 

f^Xi'gf S 9 fC&TC-r-S. ?K— ->^7-fXSP4 7 

i7- ^xi'ftBojsw^'f— i'^^rf X (y* 
5-fX»4 7fc«fc£^x^7rfXa*3§7bfc&K:, x 
i4.9tt. «^*»4 8fcJ:0JS|*iE*»t«)g$nfc 



?mm\z'i§.7t?z>. 

[0012] II:, ?^^i'7H8Sgf 2(:±ti 

7G©HB&<fcif$7, ff-fx->^ 7^X^4 7 K.kS.Si'r: 
->*7-fX©M$&, '*K#tt3lEffi4 8 K.fc.5J*#IE**l£ 
< B-f - -> * 7 -f X©*$ T , !" » 2 x- 9 «tc& 4 

[0 0 13] H(c, ±fi«ffiH3 8H. »1x-**tc 
8R4'6fC«kSx-^«7C, BPfxi^^-rXSM 7fc<fc 
5B'f.x>'t7-f X, RtfSS 2 7 s — £/K5gSR 4 9 \Z&2> 
t^-^tg©^* fc-^HT, ±ffi&S 1 ^©§S7*8"&#> 
5©«iftBSIH]*lS»U -S^r B ^iATt>J:figIl 

®W\Z&ftiE1tZ>. 

. [0 0 14] ii;, T-YX^JKaagf 2i:n^> 
^#iSSB3 9. *15*-^*5e»4 -6fcJ;.65*- 

^^7C, S-f — J'V^'f XSB4 7 Kck-SS-i'-v'-r 
X. S.^m27^-^«7C^4 9 KiS^— 9m.7C<?>&X 

ornate®*, *mm&&mmz&&&wtz. 

[0 0 15] 

[f^ffl] z.<D&?u*5£mz&.z>?4 y,97u^mm.<D 
X7-@S81l;ititfv*:offf*s#^ti-5 t f-^xi' 
7 v<i ©-x— iS'StxjLfifB^ffl ©t^ ^ y. 9 mw<D\,*-tft 

*T-l7-^LTl7-f!4[ElS^S^Uiifc 

smmz^m?"* y.9mm^<Df—9'i&7cmiiF 

ZmteTZ. Z\<D t% ±.$l.mm\Zttl>X'r—9'&7t(OM 

#5S7T-5<h, ±&mm\z^<Dw&m j gv. ^rm^z 

[0 0 16] ±ttSl*fcttt^l/- ^tf* 

4 x9mw<®w^ —is* ; ?'{xt:'mi&-?&i m-isy 
x*mt!£, yn\z-(-z/\ ?-ixifimAjfzmto*%L 

»n«,_ tofifjiSiTJieSBfcS'f — ^+ 5-fX©357 

u ^7*e»c»r-*iis*«niBKa.tej; o"# 

[0 0 17] ±ffi8fifctt^U-^*5©»S*{* 
X^gS^&S-f -->t7-fXl:J;t)X7- ©[§l8b& 
^a7C^7<£jlftJ-r-5. Etll:J;t)«*©-fx->t7-f 

XTB«*riiifc7W : :x*»«©tti*£, T^-f^^ge^ 

[ooi8] £fc±&gimfrib : r4 y97u-(pm&m 
wwomztixb, a^>9mmmiz^'ov £ ^ X97U 



(4) 
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[0019] 

@2fr*^T. ^^©x^ Xi'TWgl 

fc7^Xy-;7W$!l«gM2 ilf/HXtLT}! 
&©-tVX;7^H5 0-5 ^ 9£MJiJJg^bfc^7.^7 

«l:*otli< 5 s — * £ Etrr 3 *?> © 8 # © IE16/B x 

7tfixVX;7SB5 8, M«l'&©fif-fXi'Sl 
5 9T«£E£tt-5. 

[0 0 2 0] T-fXi'7'Kgi2lt *7h3>tfa 

J777V^ 5 £&m2tlZT/UXfflW&4T-Mf&2 

XfM»gB3 1, MPU32, T—ffemfflm&S 3\ 7 
5^1/yW3 5, Af>^3 4, TJf^SfBISSB 3 6 # 

[0 0 2 1] MPU3 2«, i?n7u?y>A 3 71: 

j; o * x h n > tf 3. - 7 i ©x- ^ mmw& izM-r 
^atc#p#a©tt^^^*x hn>fi-^ 1 1 

^-r-5fc*CD±&^gB3 8 <t; ±tfcSB&SB3 8 T?$8 
^■T^3i7-|pI«cr)««^mS?F»^fB1tSC3 6 (CD 
^>^*ffitUTlB*«^-r^P^>^i!n ; S^3 9©^ 

tB*«si*-rv>*. jekmpus 2it«, ^^-^sot 

iz&mtiffim &*^u-7 is-pfflmme&QM 
pu3 2tzm^mtv-c^^. 

[0 0 2 2] 7*A-l7smfflm4 \Z\Z. y"-< XPTU-iffl 
S841, MPU42, y*-?m&fflW&4 3. 
^^yir%<y>i7 4 4&mft>ri2>o MPU4 2I1 V 
-f i7D^P^A4 5£*?tU lt-f>^7i-XS 
SS3©MPU3 2l:ili*x hn>b'cx— ^ 1j&>£© 
r-^^jt^*fC#'5x-Y Xi7TL^f 5 Cj(t1"5 U— H 
»jf£i;7i:f;i^--r MW, Mtc*^^©x^-[5ja©fc© 

[0 0 2 3] £©x^— igtt©fca?). ?<70/n^7 

A 4 5 (Cte, ^1^-37^75^4 6 , f?-fXiy^5KX 

sm 7 - m#&i^4 8&&zf> , m2 ; T-$>m.7iW>4 9©; 

^H^iStfe-tlTl^S. 7-^fi7.7*7>^4 4 
te. x-TX^Tl^f 5\zW>rtt^-{7 7giW5 0-5 9 
^<hfC7J^7>$'«^^feoT43 0. *Xhn>tf^-^ 
. 1 £ © ^— ^ \z # 5 x X £ 7 U -f © 7 -t 

X9#©U-FIWre#Sn7c^ffix-^C^UT. EC 
C \Z =k 0 fliE^F T5]W<£X 5 - L <t £ t |5*SS§£ 
i*J»f LT-. I7-5E: L7t^^ Xi7S®(CWJS-r-5 
X— ^^i^^7J r 7>^ 4 4 ©M£ 1 -D-i >7 V 



[0 0 2 4] fgl x-^«7cSB4 6 te, x— 
7J^>^4 4©m-^Cfi^gffibT*5f9, X^-^^IhI^:. 

5>—mwm<Dn^zm'&L. i7-7-f x^it©7 

-^&fff -fX7gf 5 9 C«7C^-B--5fc*©x— ^ 
«7C$!!lI£*fT-r£. 

[002 5] ^ir-fx^gis. gic^-r-sx-^aTc 

15- 7VX^gfi£l&<]E^&fB«fflxVX:7=g 
ft?CST^Xi'SI5 8©g-x-^£&fflL.T£j£-r 
»-f^yt7-f XSB4 7tt. Hlx- 
?'&7tm4 6 T^H!f5^ X;7§gB 5 9 C^f-r-SX^-y 
^Xi7gM©x-^a7C^jE^TC7c«^©7jsX hn 
>t!o.-^ 1 ^fett^l/-^M^Igg6^e»©fg^, 3b 

-5^«^-rn©jg^fc^vi*-&tt > i^o^^x-x 

fflfflm 3 (C^^fc^O^ 3 4 aBWifiW- A' 
XD-bfc^tr@»b, x^-^f x^B©&#©?? 
<fx->^^-fX\ fiP^SJM-fMa<*:L.T©7*-V^^r 

[0026] tMMftsaustt. s-rx^^xau 

. 7fC<t-SX^-^^X^S«©i*{*:©'fX->^7'rX^ 

*&7 ifc^Tig'i u -1- x -v 7 -r xi$m?utm&v> 

T±m<DV- F£fToT\ iE^CU-h*T^7c^Sd^© 
^^^ff-5o j»#^g?4 8(C«t€,^^IE«tc^ 
7tn'«. CinTB'fx->^7-fX©^Ti:^-i». R-f 
-yt7-fX,©^7ft. *Xhn>fcfa.— ^l^«fctJC^- 

[0 0 2 7 ] * 2 x-^^TE^ 49(1 ff -f X ^< 
X^T^lC^X hn>tf3.-^ 1 $.fz\$*'<U—?fflm 

CKW&A 9 >^ 3 4"l;J:'5ir«l^^-n7D- b 
fcBSUcei&U B-f-y+ X**2S^iE«»cl6fp nj 

-fX7gt5 9©T-^*«7ct5. d©*§^ ^«© 
X i7^M 5 9 \tjE%\zW}ftVT^2>Z\ i*^, ^« 
©■x^ 7.7 mm 5 9 ©x-^£x^-@m7)m^c5 ; i' 

x>^afc3tf--r-5j:tfcfes. 

[0 0 2 8] SfC ±tt-f >^7x-X«W3iOMP 
U 3 -2 0«IBi:bTKtJft:±ffi««»3 8 »±. tA'-<X 
^PSP4©MPU4 2t«J:5|gl5 i — ^«7cgP4 6, U 
^fr:yt7.-fX84 7, «^aEgB4 8;fe±DtSI2f- 
^«7tSl54 9 fc«fc*x9-B*ifi«©H*&t*T*J:tf 
-€-©i^***X hn>tfj.— ^ l fcfflft-r*. n 
-f-yV7<XC3WJ, ^©PifcliteS-f x->^5< 

«t7tt«KWft*ffl4 8fcJ:*iE1lf*T-CII'f x^^5-f' . 

[0 0 2 9] ±ffi«*«3 8«> T^7h^y^=L-7 1 



(5) 
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fcf. &*gfta^X^7WfWfflgB2IC/PV>TV>3 
[0 0 3 0] MfcJtffi«#.« 3 8 tt. *X h n > tj- 

-r-st. *xh3>tfi-^i^fc«^i/-iS'©ifflia5 
? — mm. <Dtc ft omm^ <D&n ^jg^-r & . 

[0 0 3 1] ±&#g^gB3 8 fc<k*7frX h3>tfj.-> 
77^X^3 5©«ffitC^bT 

±tt**»3 8 ttfffi&ASIC «k 3>ti' 

3 satoitijty h$nT^-ss^tc«. *x b^>x 

a. - ^ 1 6 © 7 ^ -fe X K»T 2. Jfc^X t— ^ X £ b T 
[0 0 3 2] mt,. T-fX?7KWgg2)55*xl- 

&#-K:.fc D *X h n > ¥=L—5r 1 ^©18^^*3*1-5 „ 
— 7J, *X hn>lf^-^ 1 i^X^rP-rfWffgfi 

[0 0 3 3] @3I1 @2©^X^TI/-fS!l^gB2 

±&-f >^7i- Xft!l#|lg&3©MPU3 2 #tf:X b 
l*^©^-^teMCJ;-5Aai?a^©W 

S 2 (Cii^., 'xA--fXS«ffllSg4©MPU4 2tC*fL'J- 

I"fttf9$l4 1 £?>L/T. ^X^TH" 5©f2ttffl^ 

?07^irXC«fc5X^y7 r S.2©U-K»>#*fett7'f 

[0 0 3 4] 09* WT, *X hn>tfo. — ^ 1 #"£.©7^ 

*«w*3 1 . x-^tesuwfi*3 T-pmrnrnm 

f-fX^gi5 0-5 7 K;ttT£x-^«>i<^fe«ktf7t 
[0 0 3 5] Sfc. *7h3>tfa-^l^f,£OU-K 



x>f Xi7gfi5 0~5 7 «kt)x— ^©^ttib^fr <rV x 
^W7KaWSf4 1, 5 i -^teai«l»«4-3, x- 
^tej^«ij«a5 3 3. 5^v^;i/-f >^7x-XfiJPgC3 1 

[0 0 3 6] J;l;Xf7 7*S 3t, T-f^?7W5CD 
fg£L/£x^X^;fc5;^§7^x-;/^T<5 < , feUTiE 

^wtt&x^— **£bfc^*£g«#»tttf. xx 

7^S4l:I*, MP U 4 2 A { r-^f i v 9 53*5 y$ 

4 4©7tfJST5#£>:?XU7©X5— 5g£lEJi&:£ 1 O 

[0 0 3 7] *l;7Ty7*S5T 1 T-?fi7JA') 
.>^4 4©«©«t'K:**jfc«>fc«jeii£«s.«x5— » ' 
^©^©^X^^M^abS^S^^iy^-r-S. *>L 

•>^s 6©x^-^3ifcattr. 0 4*3«fcy;@5«, ^3 

©Xxy:/S 6©#5£WfC.£3X^— $a3©f¥8BT* 
•So uOX^-iffllCo^T,' i2©f-fX?7H5 
fcBHrrViiBttUB^W X;?gB 5 0 ©x^-fg^iggS: 
^ggfil:lLTX7-f-f X;* £¥9^ £09 

[0 0 3 8] MPU4 2lC*5V>T. IHftffl^X^SB 

5 o©^-^^i-y^*>5>i5'4 4®«*taje«t:»-r 

x^— ^^X^t^JSL-T, MPU3 2l;gfI 
I«5»tiitt«ftSg»&MPU3 2 
MPU4 2C:ttU i40XTy7*Sll:Stj;p 

B5 9 iz&jfc2i£zft&(D?—&'&KmwcDmtt!3&mm 

[0 0 3 9] |S|^tMPU3 2«, ±&3S£g& 3 8 ©*8 
0 *X h =i > tf ^-^ 1 CM Lx-^ttTC^S^ 

Ot'*MPU3 2tt, 77^X^3 50ftMfi 

Kf. #J>i*»Cck0*:x Kn>tfo.— ^ l tc^-^^TcM 
8©JM&*$R#L», — 7j, 77^0l;Ut7 h£*lT 
-Vitltf, affifTt3tlT^S*:X hn>tfi-^ l^e© 
7 1> irX^T tc# 5 Xf-? Xlf ffi fc-^^Tx- ^«tc 

[0 0 4 0] MPU 3 2^e©T : -^«7cPI^©^* 
*tJfcMPU4 2tt,. mix-^«7cgP4 6 0*l6fcJ; 
D. T-{X^7 WW»»4 1 ^LTI7-?^^ 

*©^7C^S*gS^$-&-5„ d©^— ^a7U^S«. X 
X^^B 5 0 fcpfc^fclEfif&BBtttfBer-r X^S 
B5 1~ 5 7©^— ^t?uftx-CX^SB5 8©/1U 

[0 0 4 1] flT^X^gf 5 9 I;»f5l7-f^ 
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3T¥!IS']£*l3 X^-yXS 4 tit*-, MPU 
4 2UMPU3 2 fcx— ^«7C©^T*#Sff 3. wtl 
^j-TMPU3 2H fOit077^l'y^3 5 

[00 4 2] MPU32H *X h 3 > tfa.— ^ 1 
■f5f-^x7ig^5t, Xf7yS5f, *X 

*aH&&*iT*J»T««^7£f!IWU &©X^-y:/S 
6fc»tt« £©*X hn>tf^-^ lj&>S©8SMJj£^f# 

U T^IB«gP3 6 tfir^X^gf5 9 (^TS 
5n- ^ ttTC^T ©«H £SE®T S m a * >^S £fr 
■5. 

[0 0 4 3] Xr-^T'S 5T, JftX hn>bfi— ^ 1 

XS 6 Kit*-, MPU 3 2 \~Z.ti*jy$ 3 4 £5§S!)LTB# 
r41a«$M^-rs. *'»;5'3 4li, ^*j£«>;fc»rj£l* 

CiSjSMrT. A^>^3 4^-/\'7D-T5I» 

R8£tftK:_, *x hn>hf^-^ i ^<ww 
gB6±0?f-fn-> J f 5-f Xofi*«t*titf, *©Xt-v 
XS 7 ©i&giciitj. ifcH-fryt X©S»tf*& 
< £ fc, *7>^3 4 D-Lfc^tH p u 
4 2fcW-f— ^ir7-fX*»*-r*;:ifc&*. <• 

[0 0 4 4] MPU4 2I1 MPU3 2i:i5*Xha 

-l^tzM-i r->t7-fX, * 5. v> v> t #'0* 
3 4<D-t— /xxn-fcgxKH-Y -->t7-fx:o 

[0 0 4 5] l7-r^X^gf5 0-CH-f nS'+^'f 
XCDiE#«$T**X5"y 7*S 8 TfflgiJSn* fc, M P U 4 
2tt. ^^TX5- T-tXp&MS 0 tt*fU Xr-y^ 

mk^s-^-* fe»#a^«fc±iB*y - h it, 

[0 0 4 6] i^TXfy7*S10t, X7-r^X^ 
815 0 fc*tt*JK^*ffl3©jEW*7.jS*MP U 4 2 
T*(Si]*n*t, MPU 4 2&MPU3 2 fc3-f Xv'-v 

^'(xmw&zzT&mn-rz. ;iti£gt7TMPU3 2 

^S:XT77*S 1 1 <D£.o\Z?TO. 

[0 0 4 7] n-i-~>^y-ixmm^TmBizMi', 



^OXf77S 1.2Tr, *X hn>lf3--^ 1 i«l 

2 2T. ?l'fz:->ir7'fX*a«O^T*^»«l3**3. 
6ICrtJ95D^>^fflaiLTE1t«»*-&*. *Xh3 
>tfa.-^ Ut)«^5§(tTXT7^S 1 2T« 
#^7a**U8l£ft*£, Xfy/S 1 3T, MP U 3 2 
li*7>^3 4^'Jt7MTSgX^-M, *Xh 

n>tr^-^ i sfctt^u-^-ttwap 6 

T-StffKJt^S&tlti. B5©Xfy7*S-l 4 
fcJttt. Xf77'S2 3T : -«TO 

\z.fi*?>9 3 -rn«. SSCDX^-vX 
s i 4 Cilir. 

[0 0 4 8] aSCXf'i'T'SHIIifeoTIl *Xh 
3>tf3.— ^ 1 3ffctt:*"*U-*ttfpaB 6 
*Vi«Jl©Jg^>il^< i%>. lovy? 3 4CD^"— A*7n 

mzLvtz'r-fxirmws oizttTz^ffi?-* xprnms 

9*^®^— ^«7cfg^SMPU4 2\ZMVft1<K 7— 

[0 04 9] «V X £ gi 5 9 (D^-? <D 

y^jXZmms 0\Ztt-?Z>X.5'-®'lg.<D]E1ji;mT&X^ 
yXSl 5 I'M P U 4 2 tmWiTZ t , d ©x— ^ 0« 
^a©iE^7^MPU3 2{C®»-r-5. M*PU3 2 
«, ^©tf©77^1/yX^ 3 5 (DVtmiZf&CTisX h 
n>tf^L— ^ l i7-^e^L'fcf^Xi'8l5 0 
©*IB*l:H©^7«S£X'7 i v:7 r S 1 6©A5K:ff5. 

[0 0 5 0] ^ViTX7=->y 7 P S 17T. *X hxi>tf a. 

1#&©««JE:*&#-3T43B. ^©KI:XT7 7* 
S 2 5T, :=F#5£S21ftgB 3 X^-SguLtf^ 
Xi7gI5 0A«|g«LT'3r— ^*5c«»^7bfcc:fcS|B 

J: D««AS03E7ftttlC^««IB««^5ry XS 

i 7 -r^sifeti^i. — ifflX7-f4i:ff5 la^MS 

[.0 0 5 1]— Xr7 7"S 3T, ?«fYXfgt 
5 9 fc^ff 3X5— ^-fX^ggS 0 ©x— ^«5c**jE 
t»7Tt6*^fca^l:tt. f!f-<X?8i5 9 

Xf77"S 1 8 ©17- SUSKit 
tr. rco«^(c«. X7-f^X?gf5 0 tin^-TT- 
Ir^X?SB5 9 S^il, ^SXt^-^^TcJtta* 
ff-5 o 

[0 0 5 2] ifcXT77"S 8TS^Xyt 7-fX**E 
M7L&^fc0, Xf77"S 1 Otf^ail^ 
JES^b&A^ofcW'dfctt, Xf77"S2 1t?, 
-T-fX?815 OUS-fxi-vH XSffoTfettJi 
T^*:^**ec:LT^-5t)©i;^iJ»rL, x^-^^- 
X^gf5 0©3?^fcJ;-5,X7-^ia^fT5. StC, X 
f7 7"S 1 5iC*5^T, S-fXytMX'^tOX^ 

-*ecbfc^^xi7^S^©^li^^x^SM5 9^ 
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-^S2 4T> =r<( X?mW5 0 fcff-f "->t7-fXT 

^ ^ 5 o * ^ift-r * x 5 -ffia *ff 5 1: t fc & * . 

[0 0 5 3] M', ±EO*»«li. Saf^X^glS 
[0 0 5 4] MK±1B©II««K:*-3TB:, *X hn> 

tfi-^lKftb 
5*-**5c©3ST 

S«§LTV>4*t, />/i< ■h ! bS«]©®©7 ; -^^7cPS 
teffiSi^©©©«IB^l31©^7€r^-e€n««fc 

<\ 4©ra©ffi^tt^cjEi;Tskc^j?)"s c 
[oo5 5] mz*mw\z$>iT\t. ±&mm\zm&%: 
7D-Kj:«i«iHiE«T3fc©x5-iaa©«iaK:i»w. 

SWWSfi 2 «©tt**»jl*A: < & «©*S»-r*fc». 
fb3F*5BE*«3 6 CX7-IhI«©d^>^W 

[0 0 5 6] 

[f£BJ©5&5l] «JrKWbT«;fc±3fc#»9!fc:J:n - 
fcf, ITiE^F'5J^x'-^5 i x^^©^(C«fct3, x-<X 

^THof ©^^.x^saw^-^aTc^^^n-S) 
*fctt*^v-^*»&©fp*»s*je5;»"i:-r* c 



[0 0 5 7] Jfcl7- fg£5V Xi^Ct^Tte, gffil 
&K«<xi^5^%£S^x->*7'fX^7&©£® 
U-H»f^fcJ:*ii#:|ftaE38«fTton. JEM7T-I7- 

tcLT, 5c©3i«««K:i»ttK:R.S±5fc$:D. SM* 
©-f x->ir 7-f XTIsia-ra * 3 fc^-^x y 9 ©fg 
&\Zftl,®m&<X.7-®W.&m$:fT?Z\ttfTr22>. 

mi) *mw<Dmwm.wm 

[02] *56W©— *JS«**Ufc^Py^H 
[03] *fg"E©7^irxi!i3I©iaB§©:7n— 3^— h 
[04] 13CI7- ^S©f¥aB©7D— ^-v — h 
[0 5] 0 3©x^-&g©ffftffl©:7n-^-> 
*) 

[0 6] fttemmo-fuyzm 

1 :Jb&SB (*Xhn>bfi-^) 

2 : T^X^JKWgS 
3:±tt-f>!?7i-X«IWJ 

4 : f n^7.umn 

5 : T^^^rr W 

6 : 

3 1 : -f >?7x.-^mmm 

3 2, 4 2: MPU 

3 3, 4 3: ^-^CSSMWSS 
3 4 : *>>>^ 

3 5 : 7 7^1/yX^ " 
3 6 : *ff%fei*gB 
3 7, 45:V-f^D/D^7A 
3 8 : ±fi«e» . 

3 9 : n^>^Sgg 

4 1 : "TAT.ZTV'iffimU 
4 4 : "t— 9 =5- 3-V 2 

4 6 : SBl^-*«7cSB 
4 7 : ff-f x->*7'fXSB 
4 8 : JSMfr&5£SB 
4 9 : m,2^-i?®.7i.?& 



(8) 



8-147112 



mil 



i tit 









3 8 



—39 



— 2 



4 6-M 
4 7- 



-4 8 
—4 9 



0 9 ^$-$6 el™ 

50 51 52 53 57 5* 5^ L 



#0 #1 



#8 #9 f 



[05] 



SI 4- 



SI 5 



s l 6-f 




S2 s 4 



(return) 



S 17 




S2^5 



(9) 



8-14 7 112 



3 _ 



I 

9 



MPU 3^ 



3 9- 



35 77? 



l 

i 
i 
i 

I 



3 7 



4 - — _ 



4 6 



I 
I 

/ft 
4 9 



33 



34 



0 



36 



I 



4' 



42 MPU 
4 7 



4 3 



48 



~l 

44 



"T 
I 

* 

si 



4 5 -7^^D70/5A 



5 0 51 




T 

r 



57 1 59 TOr^^SX 



(10) 



$#M¥8 - 14 7 112 



[0 3] 



NO 



S 2— 




YES 



S3 



NO 



S4— 




NO 



S6— 




J 



[06] 



_h £ft 3g X 



908-9 

50 5 1 52 53 57 58 5^9 



(11) 



4#M¥8 - 14 7 112 



[04] 



Si 



3 _ 



S2- 



S3 



S4 -J 



S5 



S7 — 




S6 



S8 




S9- 



WMftalEWHfrMfc 



SI 0 




S1 ^ ±fc3S*fc«4-^-v^;* 



SI 2 



S 1 3 



-S 1 8 



(return^ 



1 



KS 1 9 




S20 



-S21 



(return) 




■V-S2 2 



S23 



BP- 



UNAUTHORIZED REPRODUCTION IS PROHIBIW 

•' wuthv.. aerr.oouv is frohbip& 



AVAILABLE COPY 



A Case for Redundant Arrays of Inexpensive Disks (RAID) 



David A. Patterson, Garth Gibson, and Randy H. Katz 

Computer Science Division 
Department of Electrical Engineering and Computer Sciences 
571 Evans Hall 
University of California 
Berkeley, CA 94720 i 



tract Increasing performance of CPUs and memories will be 
PsWffP* VV 0 * etched by a similar performance increase in I/O While 
mmity of Single Large Expensive Disks (SLED) has grown rapidly 
mgrformance improvement of SLED has been modest. Redundant 
K |f.o//«ttpe«/vf Disks (RAID), based, on the magnetic disk 
mmohgy. developed for personal computers, offers an attractive 
1|P?* Var to SLED * Promising improvements of an order. of magnitude in 
mtinonce, reliability, power consumption, and scalability. This paper 

W^^Zi?" 1 * **** relative cost/performance, and 

qtpares HMD to an IBM 3380 and a Fufitsu Super Eagl? 

JBg^ground; Rising CPU and Memory Performance 
W^yT S r° r Cpmputors m c «™Uy enjoying unprecedented growlh 
Z&ej'Specd of computers. Gordon Bell said that between 1974 and 1984 
TOjF^ Computcxs im P«>vcd in performance by 40% per year about 
ge the rate of minicomputers 'Bell 84). In the following year Bill Joy 
ilicted an even faster growth [Joy 85]: 

MIPS*!*"":™** 

^nframe and supercomputer manufacturers, having difficulty keeping 
the rapid growth predicted by "Joy's Law," cope- by offering 
^processors as their lop-©f-ihe-line product 
a^? U| '" fast does not a fast system make. Gene Amdahl related 
speed to mam memory size using this rule [Siewiorek 82]: 

p|>: Each CPU instruction per second requires one byte of main memory; 

4te^M J ? rStem C ° StSarCn0tt0bc dominated by the cost of memory. 
^ Amdahl s constant suggests that memory chip capacity should grow 
|ge.samcrrate. Gordon Moore predicted that growth fn* overio yea^ 

iransistors/chip =2 re *r-1964 

l^p^icted by Moore's Law. RAMs have quadrupled in capacity every 
ggtelMbore 75] to three years [Myers 86]. 

$£ri??* na ?!* U lJ!? t50 , ° f of mai " memory to MIPS 'has been 

WE? ^ J°«* **L with Amdahl's constant meaning alpha JTTu 
^In^r ° fUlC ^ P r d ^ of ^ m °Or Prices, main memoW sizes have 
p^yn faster than CPU speeds and many machines are shipped today with 
«-vj?J^- of 3 or higher. ■ ■ 

lllk 1 * P roainlain mc baIa,,ce ot costs in computer systems, secondary 
mm**' must match the advances in other parts of the system. A key meas 
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. ^ pW!qoaie,nch ' orthe «>>'*I>"inch in a track 
times the number of tracks per inch. CallnlMi iw i . . 
densiiv th, »c!r M T „ ^.V^. . ual,ed MAX)., for maximal area 
density, the First Law in Disk Density predicts IFrank87] : 

AfA£> 8 I00'ea'-l97iyi0 

M^ne^^^otogyhasdouMe,,^ ty^h^ price every three 
years, in line with the growlh rate of semicondnctor hWorT^iT 
practice between 1967 and 1979 the disk capacity of tte a^effiM <Z 

mnhS 1 ? 9 * " 1,16 memtS 0- characterise Tat muTgrow 
rapidly, u, maintain system balance, since the speed" wTtTwMch 
instructions and data are delivered to a n>f t *i m h.. • : wmcn 
performance. Th^L^r^i- . 2180 determines its ulumaie 

f^^^P 6 speed of main memory lias kept |jace for two reasons- 
(1) the myenuon of caches, showing that a ^t^aT^^d 

W and the SRAM technology, used to build caches, whose speed has 
improved at the rateof40% to 100% per year 

. iy/I to 1Q »1, the raw seek time for a hieh-end IBM rticV 

taSSS^? 1: dCnsi,y means 8 ^8h" transfer rate when the 

urc raw sees time onjy improved at a rate of 7% ner vear Th^r, ;1 „„ 
reason to expect a faster rate in the near fnture. * ' "° 

massive amount, of daraT^as^^ ! <**V*mO* . 

•'*--^-^^2SZXS runn,n8 on 

2. The Pending I/O Crisis 

as Amdahl's Law [AmdaNoTJ: Amdahl s answer is now known 

1 

S = - 

where: . {l *+* k 
S = the effective speedup; 
/= fraction of work in faster mode; and 
= speedup while in faster mode 

over three years-„hen CX^^T^° Z ^ * BU ' J ° y in jusl 
5X. When we haveTCulera ITOXf^r • "T. S ^ UP!ViU beon " 
or by ™*PrOcessora- P ;^ 

wasting 90% of the potential speedup * ' han 10X ftSter - 
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While we Can imagine improvements in software file systems via 
buffering for near cerm I/O demands, we need innovation to avoid an I/O 
crisis [Boral 83}. 

3. A Solution: Arrays of Inexpensive Disks 

Rapid improvements in capacity of large disks have not been the only 
target of disk designers, since personal computers have created a market for 
inexpensive magnetic disks. These lower cost disks have lower perfor- 
mance as well as less capacity. Table I below compares the top-of-the-Iine 
IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagie" 
minicomputer disk, and the Conner Peripherals £P 3100 personal 
computer disk. 



Characteristics 



IBM Fujitsu Conner* 
3380 M2361A CP3100 



Disk diameter (inches) 14 10.5 . 3.5 

Formatted Data Capacity (MB) 7500 600 100 
Price/MB(controllerincl.) $18-510 $20-$17 $10-$7 
MTTF Rated, (houis) 30,000 20,00030,000 



3380 v. 2361v 
3100 3100 
(>1 means 
3100 is better) 
4 3 

". .01 a 

1-2.5 1.7-3 



MTTF in practice (hours) 100,000 
No. Actuators 4 
Maximum I/OVsecond/Actuator 50 
Typical I/O Vsecond/Actuator ,30 
Maximum I/O Vsecortd/box 200 
Typical VO Vsccond/box 120 
Transfer Rate (MB/sec) 3 
Power/box (W) 6.600 
Volume (cit ft.).: 24 



7 
1 

40 
?4 
40 
24 

2S 
640 

3.4 



? 

1 
30 
20 
30 
20 

1 

10 



.03 



a. 

.6 
.7 
.2 

J2 
3 

660- 64 
800 110 



1.5 
7 
1 

.8 
.8 
.8 
.8 
.4 



Table I. Comparison of IBM 3380 disk model AK4 for mainframe 
computers; the Fujitsu M2361 A "Super Eagle" disk for minicomputers, 
and the. Conner s Peripherals CP 3100 disk for personal computers. By 
"Maximum JlO'slsecond" we mean the maximum number of average seeks 
and average rotates for a single sector access. Cost and reliability 
Information on the 3380 comes from widespread experience • [IBM 87 J 
[Qawlick87.1 and the information on the Fujitsu from. the manual [Fujitsu 
87) V while, some numbers on the new CP3100 are based oh speculation. 
The price per megabyte Js given as a range to allow for different prices for 
volume discount and different, mark-up practices of the vendors. (The 8 
watt maximum power of the CP3100 was increased to 10 watts to allow 
for the inefficiency of an external power supply, since the other drives 
contain their own power suppliesh 

One surprising fact is that the number of VPs per second fcer actuator in an 
inexpensive disk, is within* a factor of two of the large disks. In several of 
the remaining metrics, including price per megabyte, the inexpensive disk 
is superior or equal to the large- disks. 

The small size and low power, are "even- more impressive since- disks 
such as the CP3 100 contain full' track Tniffers and most functions of the 
traditional mainframe controller. Small disk manufacturers can provide 
such functions in high volumedisks because of ihe : efforts of standards 
committees in defining Higher level peripheral interfaces, such as the ANSI 
.X3. 13 1-1986 Small Computer System Interface (SCSI>. Such standards 
have encourage^ companiesjike Adeptec to offer SCSI interfaces as single 
chips, in turn allowing disk companies to embed mainframe controller 
functions at low cost Figure 1 compares the traditional mainframe disk 
approach and the small computer disk approach. The same SCSI interface 
chip embedded as a controller in every disk can also be used as the direct 
memory access (DMA) device at the other end of the SCSI bus. 

Such characteristics lead to our proposal for building VO systems as 
arrays of inexpensive disks, either interleaved for the. large transfers of 
supercomputers [Kim 86J[Livny 87][Salem8^3 pr independent for the many 
small transfers of transaction processing... Using the information: in Table 
J, 75 inexpensive disks potentially have 12 times the I/O bandwidth of the 
flBjyf ^iSQjalid the same-capacity^ with lower power consumption and cost 

W^IM^ tfto suc : n a| T a y s m lhc space 

'^'^^^^fe^ todajnental. estimates of 




price-performance and reliability. Our reasoning is that if there are no 
advantages in price-performance or terrible disadvantages in reliability, then 
there is no need to explore further. We characterize a transaction-processing 
workload to evaluate performance of a collection of inexpensive disks, but 
remember that such a collection is just one hardware component of a 
complete tranaction-processing system. While designing a complete TPS 
based on these ideas, is enticing, we will resist that temptation in this 
paper. Cabling and packaging, certainly an issue in the cost and reliability' 
of an array of many inexpensive disks, is also beyond this paper's scope. 



Mainframe 



Small Computer 




Figure 1. Comparison of organizations for typical mainframe and small 
computer disk interfaces. Single chip. SCSI interfaces such as the Adaptec 
AIC-45250 allow the small computer to use a single chip to be the DMA . 
interface as well as provide an embedded controller for each disk [Adeptec 
871 . (The price per megabyte in Table I includes everythmg in the shaded 
boxes above.) 

5. And Now The Bad News: Reliability 

The unreliability of disks forces computer systems managers to make 
backup versions of information quite frequently in case of failure. What 
would be the impact on reliability of having a hundredfold increase in ' 
disks? Assuming a constant failure rate-that is, an exponentially 
distributed time to failure—and that failures are independent— both 
assumptions made by disk manufacturers when calculating the Mean Time 
To Failure (MTTF)— the reliability Of an array of disks is: 



MTTF of a Disk Array 



MTTF of a Single Disk 
Number of Disks in the Array 



Using the information in Table I, the MTTF of 100 CP 3100 disks is . 
30,000/100 - 300 hours, or less than 2 weeks. Compared to the 30,000 
hour (> 3 years) MTTF of the IBM 3380, this is dismal. If we consider 
scaling the array to 1000 disks, then the MTTF is 30 hours or about one 
day. requiring an adjective worse than dismal. 

Without fault tolerance, large arrays of inexpensive disks are- too 
unreliable tobe useful. - 
6; A Better Solution: RAID 

To overcome the reliability challenge, we must make use Of extra 
disks containing redundant information to recover the original information 
when a disk fails. Our acronym for these Redundant Arrays of Inexpensive 
Disks isRAlD. To simplify the explanation of our final proposal and to 
avoid confusion with previous work, we give a taxonomy of five different " 
organizations of disk arrays, beginning with mirrored disks and progressing 
through a variety of alternatives with differing; performance and reliability. 
We refer to each organization as a RAID*ifevW. 

The reader should be forewarned that we describe all levels as if 
implemented in hardware solely to simplify the presentation, for RAID 
ideas are applicable to software implementations as well as hardware. 

Reliability. Our basic approach will be to break the arrays into 
reliability groups, with each group having extra "check" disks containing, 
redundant information. When a disk fails we assume that within a short 
time the failed disk will be replaced and the information will be 
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Since ihc formula is the same for each level; we make the abstract 
numbers, concrete, using these parameters as appropriate: P=100 total data 
disks, C= 10 data disks per group. MTTF Disk * 30.000 hours. MTTR = 1 
hour, with the check disks per group C determined by the RAID level 

Reliability Overhead Cost. This is simply the extra check 
disks, expressed as a percentage of thenamber of data i disks D. As we shall 
see below, the cost varies with RAID level from 100% down to 4%. 

Useable Storage-. Capacity Percentage. Another way to 
express this reliability overhead is in terms of the percentage of the total 
capacity of data , disks and check disks that can be used to store data. 
Depending on the organization; this varies from a low of 50% to a high of 
96%. 

Performance. Since supercomputer applications and 
transaction-processing systems have different access -patterns and rates, we 
need different metrics to evaluate both. For supercomputers we count the 
number of reads and writes per second for large blocks of data, with large 
defined as getting at least one sector from each data disk in a group. During 
large transfers all the disks in a group act as a single unit* each reading or 
writing a portion of the large data block in parallel. 

A better measure for transaction-processing systems is the number of 
individual reads or* writes per second. Since transaction-processing 
systems (e.g., debits/credits) use a read-modify-write sequence of disk . 
accesses, we include that metric as well. Ideally during small transfers each 
disk in a group can act independently; either reading or writing independent 
information. In summary supercomputer applications need a high data rate 
while transaction-processing need a high I/O rate. 

For- both the large and small .transfer calculations We' assume the 
minimum user request is a sector, that a sector is small relativeto a track, 
and .that there is enough work to keep every device busy. Thus sector size 
affects both disk storage efficiency and transfer size. Figure 2 shows the 
ideal operation of large and small disk accesses in a RAID. 



reconstructed on to the new disk using the redundant information. This 
time is called the mean time to repair (MTTR). The MTTR can be reduced 
if the system includes extra disks to act as "hot" standby spares; when a 
disk fails, a replacement disk is switched in electronically. Periodically a 
human operator replaces all failed disks. Here are other terms-thai we use: 

D as total number of disks with data (not including extra check disks); 

G = number of dataigAskV in a group (hot including extra check disks); 

C = number of cheat disks in a group; 

Hq =D/G = number of groups; 

As mentioned above we make the same assumptions that disk 
manufacturers make-that failures are exponential and independent (An 
earthquake or power surge is a situation where an array of disks might not 
fail independently.) Since these reliability predictions will be very high, 
we want to emphasize "that the reliability is only of the the disk-head 
assemblies with this failure model, and riot the whole software and 
electronic system. In addition, in our view the pace of technology means 
extremely high'MTTFare "overkUP-for, independent of expected lifetime, 
users will replace obsolete disks. After all. how many people are still 
using 20 year old di sks? 

The general -MTTF calculation for single-error repairing RAID is 
given in two steps. First, the group MTTF is: 

MTTF Disk , J 

MTTF Cro%tp = . o — . _ — 

) G+C Probability of another failure in a group 

before repairing the dead disk 

As' more formally derived in the appendix, the probability of a second 
failure before the first has been repaired is: 

MTTR MTTR 

Probability of = '. ^ = — : — : . . 

Another Failure MTTF^^/Q^o. Disks-l) MTTF Disk /(G+C-l) 

The intuition behind -the formal calculation in -the appendix comes 
from trying to calculate the average number of second disk failures during 
the repair time for* single disk failures: Since we assume that disk failures 
occur at a uniform rate, this average number of second failures during the 
repair time for X first failures is 

X*MTTR 



MTTF of remaining disks in the group 

The average number of second failures for a single disk is then 
MTTR 

MTTF dm /-No. of remaining disks in the group 

The MTTF of the remaining disks is just the MTTF of a single disk 
divided by the number of good disks in the group, giving the result above. 

The second step is the reliability of the whole system, which is 
approximately (since MTTFQ roU p is not quite distributed exponentially): 

MTTF Group 

MTTF RAID = 

"G 

Plugging it all together, we get: 

MTTF Disk M7TF Disk 1 

MTTF RAID = * * - • 

G+C {G+C-\)*MTTR /yj 
<MTTF Di5 £p 



{G+Cj*nQ * (G+C-1)*A/7T* 

tMTTF DisJ p 

MTTFmd = ~ : : r- 

<P+C*n c )*(G+C-l)*Af7T/? 
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(a) Single Large or "Grouped^ [Read 
(J read spread 6ver G disks) 




Figure 2. Large transfer vi. small transfers in a groWp of G' disks. 

■ The six performance metrics are then the number Of reads, 'writes, and 
jead-modify-writes per second for both large (grouped) or small (individual) 
transfers. Rather than give absolute numbers for each metric, we calculate 
efficiency: the number of events per second for a RAID relative to the 
corresponding events per second for a single disk. (This is.Boral's I/O 
bandwidth per gigabyte [B oral 83) scaled to gigabytes per disk.) In this 
paper we are after fundamental differences so we use simple; deterministic 
throughput measures for Our performance metric rather man latency. 

Effective Performance Per Disk. The cost of disks can be a 
large portion of the cost of a database system, so the I/O performance per 
disk—factoring in the overhead of the check disks—suggests the 
costfrerforrnance of a system; This is the bottom line for a RAID. 
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7. First Level RAID: Mirrored Disks 

Mirrored disks arc a traditional approach for improving reliability of 
magnetic disks. This is ihe most expensive option we consider.since all 
disks are-duplicaied (C=l and C=l), and every write to a data disk is also a 
writeto a check disk. Tandem doubles the number of controllers for fault 
tolerance, allowing an optimized version of mirrored disks that lets reads 
occur in parallel Table n shows the metrics for a Level 1 RAID assuming 
this optimization. 



MTTF 

Total Number of Disks 
Overhead Cost ■ 
Useable Storage Capacity 

Events/Sec vs. Single Disk 
Large (or Grouped) Reads 
Large (or Grouped) Writes 
Large (or Grouped) R-M-W 
Small (or Individual) Reads 
Small (or Individual) Writes ' 
Small (or Individual) R-M-W 



Exceeds Useful Product Lifetime 
(4,500,000 hrs or > 500 years) 
2D 
100% 
50% 



Full RAID 

2D/S 

D/S 

AD/3S 

2D 

D 

4t>n 



Efficiency Per Disk 
1.00/S • 
.50/5 
.67/5 * 
1.00 
.50 
.67 



Table If. Characteristics of Level J. RAID. Here we assume that writes 
are not slowed by waiting for the second write to complete because ihe 
. slowdown for writing 2 disks is. minor compared to the slowdown Sfor 
writing ja whole group of JO to 25 disks. Unlike a "pure" mirrored scheme 
with extra disks that are invisible to the software, we assume an optimized 
scheme with twice as many controllers allowing* parallel reads to all disks, 
giving full disk bandwidth for: large reads and allowing the reads of 
read-mod^writes to occur in' parallel 

When individual accesses are distributed across multiple disks, average 
queueing, seek, and rotate delays may differ from the single disk case. 
Although bandwidth may be unchanged, it is distributed more evenly, 
reducing variance in queueing delay and, if the disk load is not too high, 
also reducing the expected queueing delay through parallelism (Livny 87). 
When many arms seek to the same track then rotate to the described sector, 
the average seek arid rotate time will be larger than the average for a single 
disk, tending toward the worst case times'. This affect should not generally 
more than double the average access time to a single sector while still 
getting many sectors in parallel. In the special case of mirrored disks with 
sufficient controllers, the choice between arms that can read any data sector 
wiU reduce the time for the average read seek by up to 45% [Billon 88). 

To allow for these factors but to retain our fundamental emphasis we 
apply a slowdown factor, S, when there are moje than two disks in a 
group. In general, whenever groups of disk work in parallel. 

With synchronous disks the spindles of all disks in the group are 
synchronous so that the correspondirig sectors of a group of disks pass 
under the heads simultaneousIy,[Kurzweil 88] so for synchronous disks 
there is no slowdown and S « 1 ..Since a Level 1 RAID has only one data 
disk in its group, we assume that the large transfer requires the same 
number of disks acting in concert «s found in groups of the higher level 
RAIDs: 10 to 25 disks. 

Duplicating all disks can mean doubling the cost of the database 
system or using only 50% of the disk storage capacity. Such largess 
inspires the next levels of RAID. 

8. Second Level RAID: Hamming Code for ECC 

The history of main memory organizations suggests a way to reduce 
the cost of reUability. With the introduction of 4K and 16K DRAMs. 
computer designers discovered that these new devices were subject to* 
losing information due to aJpha particles. Since there were many sjngle 
bit DRAMs in a system and since they were usually accessed in groups of 
16 to 64 chips at a time, system designers added redundant chips to correct 
single errors and to detect double errors in each group. This increased the 
number of memory chips by 12% to 38%-depending on the size-of the 
group-but it significantly improved reliability. 

As long as all the data bits in a group are read or written together, 
there is no impact on performance. However, reads of less than the group 
size require reading the whole group to be sure the information is correct, 
and writes to a portion of the group mean three steps: 



J) a read step to get all the rest of the data; 

2) a modify step to merge the new and old information; 

3) a write step to write the full group, including check information. 
Since we have scores of disks in a RAID and since some accesses are 

to groups of disks, we can mimic the DRAM solution by bit-interleaving 
the data across the disks of a. group and then add enough check disks to 
detect and correct a single crrori A single parity disk can detect a single 
error,, but to correct an error we need enough check (iisks to identify the 
disk with the error. For a group size of 10 data disks (G) we need 4 check 
disks (C) in total, and if G = 25 then C = 5 [HammingSQ]. To keep down 
the cost of redundancy, -we assume the group size will vary from 10 to 25. 

Since pur individual data* transfer unit is just' a sector, bit-' interleaved 
disks mean that a large transfer for this RAID must be at least G sectors. 
Like DRAMs, reads to a smaller amount implies reading a full sector from 
each of the bit-interleaved disks in a group, and writes of a single unit 
involve the read-modify-write cycle to all the disks. Table in shows the 
metrics of this Level 2 RAID. • _ 

MTTE Exceeds Useful Lifetime 

G=70 &=25 
(494,500 hrs (103,500 hrs 

or>50 years) or 12 years) 

Total Number of Disks 1.40D 1.20D 

Overhead Cost 40% 20% 

• Useable Storage Capacity 71% 83% 

Events/Sec Full RAID Efficiency Per Disk Efficiency Per Disk 

(vs. Single Disk) 12 L2/U 12 L21U 

LargeReads D/S .71/5 71% .86/S 86% 

Large Writes D/S .71/5 143% .8675 172% 

Large R-M-W D/S .71/5 107% .8675 129% 

SmallReads D/SG .07/5 6% .03/S 3% 

Small Writes D/ZSG .04/S 6% .02/S 3% 

Small R-M-W D/SG .07/5 9% .03/5 4% 

Table 111. Characteristics of a Level 2 RAID. The 12 IU. column gives 
the % performance of level 2 in terms of level I (>I00% means 12 is 
faster). As long as the transfer unit is large enough to spread over all the 
data disks of a group, the large IlOs get the full bandwidth of each disk, 
divided byS to allow all disks in a group to complete. ' Level I large reads 
are faster because data is duplicated and so the redundancy disks can also do 
independent accesses. Small I/Os still require accessing all the disks in a 
group, so only DIG small ItOs can happen at a time, again divided bySto 
allow a group of disks to finish: Smalt Level 2 writes dre like small 
R-M-W because full sectors must be read before new data can be written 
onto pari of each sector. 

For large writes, the level 2 system has the same performance as level 
1 even though it uses fewer check disks, and so on a per disk basis it 
outperforms level 1; For small data' transfers the performance is dismal 
either for the whole system or per disk; all the disks of a group must be" 
accessed for a small transfer, limiting the maximum number of 
simultaneous accesses to DIG. We also include the slowdown factor S 
since the access must wait for all the disks to complete. 

Thus level 2 RAID is desirable for supercomputers but inappropriate 
for transaction processing systems, with increasing group size increasing 
the disparity in performance per disk for the two applications. In 
recognition of this fact. Thinking Machines Incorporated announced a 
Level 2 RAID this year for its Connection Machine supercomputer called 
the "Data Vault," with G = 32 and C = 8, including one hoi standby spare 
[Hillis87]. 

Before improving small data transfers, we concentrate once more on 
lowering the cost. 



9. Third Level RAID: Single Chec^Disk Per Group 

Most check disks in ihe level 2 RAID are used to determine which 
disk failed, for only one redundant parity disk is needed to detect an error. 
These extra disks are truly "redundant" since, most disk controllers can 
already delect if a disk failed: either through special signals provided in the 
disk interface or the extra checking information at the end of a sector used 
to detect and correct soft errors. So information on the failed disk can be 
reconstructed by calculating the parity of the remaining good disks and 
then comparing bit-by-bit to the parity calculated for the original full 



group. When these two parities agree, ihc failed bit was a 0; otherwise it 
was a I. If the check disk is the failure, just read all the data disks and store 
the group parity in the replacement disk. 

Reducing, the check disks 10 one per group (C= 1 ) reduces .the overhead ' 
cost to between 4% and 10% for the group sizes considered here. The 
performance for the third Jev^l RAID system is the same as the Level 2 
RAID, but the. effective pOTormance per disk increases since it needs fewer 
check disks. This reduction in total disks also increases reliability, but 
since it .is still larger than the useful lifetime of disks, this is a minor 
point. One advantage of a level 2 system over level 3 is that the extra, 
check information .associated with each sector to correct soft errors is not 
needed, .increasing .the capacity per disk by perhaps- 10%. Level 2 also 
allows all soft errors to be corrected "on the fly" without having to. reread a 
sector." Table IV summarizes the third level RAID characteristics and 
Figure 3 compares the sector layout and check disks for levels 2 and 3. 
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Table IV. Characteristics, of a Level 3 RAID. The L3/L2 column gives 
the % performance of 13 in terms of 12 and the 13 /U column gives it in 
terms of U (>I00% means 13 is faster). The performance for the full 
systems Js the same in RAID levels 2 and 3, but since there are fewer 
check disks the performance per disk improves. 

Park and Balasubramanian. proposed a third level RAID system 
without suggesting a particular application [Park861. Our calculations 
suggest it is a much, belter match 'to supercomputer applications than -to 
transaction processing systems. This year two disk manufacturers have 
announced level 3 RAIDs for such applications using synchronized 5.25 
inch disks with G=4 and C=l: one from Maxtor and one from Micropenis 
[Maginnis 87). 

This third level has brought the reliability overhead cost to its lowest 
level, so in the last two levels We improve performance of small accesses 
without changing cost or reliability. 

) 10. Fourth Level RAID: Independent Reads/Writes 

Spreading a transfer across all disks within the group has the 
following advantage: 

Large or grouped transfer time is reduced because transfer 
bandwidth of the entire array can be exploited: 
But it has the following disadvantages as well: 

* ReadingAvriUng to a <iisk in a group requires reading/writing to 
all the disks in a group; levels 2 and 3 RAIDs can perform only 
one VO at a time per group. 

If the disks are not synchronized, you do not see average seek 
and rotational delays; the observed delays should move towards 
the worst case, hence the S factor in the equations above. 
This fourth level RAID improves performance of small transfers through 
parallelisms the ability to do more than one I/O per group at a time. We 
no longer spread .the individual transfer information across several disks, 
but keep each individual unit in a single disk. 

The virtue of bit-interleaving is the easy calculation of the Hamming 
code needed to detect or correct errors in level 2. But recall that in the third 
level RAID we rely on the disk controller to detect errors within a single . 
disk sector* Hence, if we store an individual transfer unit in a single sector, 
we can detect errors on an individual read without accessing any other disk. 
Figure 3 shows the different ways the information is stored in a sector for 



RAID levels 2, 3, and 4. By storing a whole transfer unit in a sector, reads 
■can be independent and operate at the maximum rate of a disk yet still 
detect errors. Thus the primary change between level 3 and 4 is that we 
interleave data between' disks at the sector level rather than at the bit level 
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Figure. 3. Comparison of location of data and check information in 
sectors for RAID levels 2.3. and 4 for G±4. Not shown is the small 
amount of check information per sector added by the disk Controller to 
detect and correct soft errors- within a sector. Remember that we use 
physical sector numbers and hardware control to explain these ideas, but 
RAID can be implemented by software using logical sectors and disks. 

At first thought you might expect thai an'individual write to a single 
sector still involves all the disks in a group since 0) the check disk' must 
be rewritten with the new parity data, and (2) the rest of the data disks 
must be read to be able to calculate the new parity data. Recall that each 
parity bH is just a single exclusive OR of ajl the corresponding data biis in 
a group. In level 4 RAID, unlike level 3, the parity calculation. is much 
simpler since, if we know' the ol8 data value and the old parity value as 
well aS the new data -value, we can calculate the new parity information as 
follows: ' - 

new parity = (old data xor new data ) xor old parity 
In level 4 a small write then uses 2 disks to perform 4 accesses--2 reads 
and 2 writes-While a small read involves onlyone read on one disk. Table 
•V summarizes the fourth level RAID characteristics. Nofe that ail small 
accesses irnproVe-dramatically for the reads-but the small 
read-modiry*wri<e is. still so slow relative to a leVel l.RAID thatjts 
applicability, to transaction processing is doubtful! Recently' $alem and 
Garcia-Molina proposed a Level 4 system (Salem 86). 

Before proceeding to the next level *we need to explain the 
performance of small writes in Table V (and hence small 
read-mbdify^writes since they entail the same operations in this RAID). 
The formula for the small writes divides D by 2 instead of 4 because 2 



accesses can proceed in parallel:. the old data and old parity can be read at 
the same time and the new data and new parity can be wriuen at the same 
time. 'The performance of small writes is also divided by G because the 
single check disk in a group must be read and written with every small 
write in that group, thereby limiting the number of writes that can be 
performed at a time to the number of groups. 

The check disk is the bottleneck, and the final level RAID removes 
this bottleneck. 
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Table V. Characteristics of a /Level 4 RAID. The L41L3 column gives 
the % performance of L4 in terms of 13 and theL4JH column gives it in 
terms of L2 (>J00% means LA is faster). Small reads, improve because 
they no longer tie upa_ whole group at a time.. Small writes and R-M-Ws 
improve spine because we, make the same assumptions as . we made in 
Table II: the slowdown for two related IlOs can be ignored because only 
two disks are involved, 

11. Fifth Level 1 RAID: No Single Check Disk 

While level 4 RAID, achieved parallelism forjreads, writes are still 
limited to one per group since every write must read and write -the check 
disk. The final level .RAID distributes the data and check information 
across- all the disks—including the check disks! Figure 4 compares the 
location of check information in the sectors of disks for levels 4 and 5 
RAIDs, 

The performance impact of this small change is large since RAID 
level 5 can support multiple individual writes per group. For example, 
suppose in Figure 4 above we want to write sector 0 of disk 2 and sector 1 
of disk 3. As shown on the left Figure 4, in RAID level 4 these writes 
must be sequential since both sector 0 and sector 1 of disk 5 most be 
written. However, as shown on the right, in RAID level 5- the writes can 
proceed in parallel since a write to sector 0 of disk 2 still involves a write 
lo disk 5 but a write to sector 1 of disk 3 involves a write to disk 4. 

These changes bring RAID level 5 near the best of both worlds: small 
read -modify- writes now perform close to the speed per disk of a level 1 
RAID while keeping the large transfer performance per disk and high 
useful storage capacity percentage of the RAID levels 3 and 4. Spreading 
the data across all disks even improves trje performance of small reads, 
since there is one more disk per group that contains data. Table .VI 
summarizes the characteristics of this RAID. 

Keeping in mind the caveats given earlier, a Level 5 RAID appears 
very attractive: if you want to do just supercomputer applications, or just 
transaction processing when storage capacity is limited, or if you want to 
do both supercomputer applications and transaction processing. 
12. Discussion 

Before concluding the paper, we wish to note a few more interesting 
points about RAtDs. The first is that while the schemes for disk striping 
and parity support were presented as if they were done by hardware, there is 
no necessity to do so. We just give the method, and the decision between 
hardware and software solutions is strictly one of cost and benefit For 
example, in cases where'disk buffering is effective, there is no extra disks 
reads for level 5 small writes since the old data and old parity would be in 
main memory, so software would give the best performance as well as the 
least cost 

In this paper we have assumed the transfer unit is a multiple of the 
sector. As the size of the smallest transfer unit grows larger than one 
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(b) Check information for 
Level 5 RAID forG=4 and 
C=Ji The sectors are shown 
below the disks, with the 
check information and data- 
spread evenly through all the 
disks. Writes to sO of disk 2 
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Figure 4. Location of check information per sector for Level 4 RAID 
vs. Level 5 RAID. j ; 
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Table VI* Characteristics of a Level $ RAID. The L5/L4 column gives 
the % performance of L5 in terms oflA and the LSIL1 column gives it in 
terms of LJ (>100% means LS is faster). Because reads can be spread over 
all disks, including what were check disks in level 4, all small IlOs 
improve by a factor of /+C/G. Small writes and R-M-Ws improve because 
they are no longer constrained by group size, getting the full disk 
bandwidth for the 4 J/O's associated with these accesses. We again make 
the same assumptions as we made in fables II and V: the slowdown for 
two related IlOs can be ignored because only two disks are involved. 
sector per drive-such as a full track with an I/O protocol that supports data 
returned out-of-order-then the performance of RAIDs improves 
significantly because of the full track buffer in every disk. For example, if 
every disk begins transferring to its buffer as soon as it reaches the next 
sector, then S may reduce to less than 1 since there would be virtually no 
rotational delay. With transfer units the size ofea track, it isnot even clear 
if synchronizing the disks in a group improves 'RAID performance. 

This paper makes two separable points: the advantages of building 
I/O systems from personal computer disks "and trie advantages of five 
different disk array organizations, independent of disks used in those array. 
The later point starts with the traditional mirrored disks to achieve 
acceptable reliability, with each succeeding level- improving: 

• the data rate t characterized by a small number of requests per second 
for massive amounts of sequential information (supercomputer 
applications); 



• the 1/Orate, characterized by a large number of read-modify-writes to 

a smalt amount of random information (transaction -processing); 

• or the useable storage capacity, 
or possibly all three 

Figure 5 shows the performance improvements per disk for each level 
RAID. The highest performance per' disk comes from either Level 1 or . 
Level 5. In uansaction-prccessing situations using no more than. 50% of 
storage capacity, then the<8roice*is mirrored disks (Level 1). However, if 
the situation calls for using more than 50% of storage capacity,. or for 
supercomputer applications, or for combined supercomputer applications 
and transaction processing, then. Level 5 looks best Both the strength and 
weakness of Level 1 is that it duplicates data rather than calculating check 
information, for the duplicated data improves read performance but lowers . 
capacity and write. performance, while check.dala is useful only on a failure. 

. Inspired by the space-time product of paging studies [Denning 78], we 
propose a single figure of merit called .the space-speed product, the useable 
storage fraction times the efficiency per event! Using this metric. Level 5 
has an advantage over- Level 1 of 1.7 for reads and 33 for writes for G=10. 

Let us return to the first point, the advantages of building I/O system 
from personal computer disks. Compared to traditional. Single Large 
Expensive Disks (SLED), Redundant Arrays of Inexpensive Disks (RAID) 
offer significant advantages for the same cost. Table VII compares a level 5 
RAID using 1Q0 inexpensive data disks with, a group size' of 10 to. the 
IBM 3380. As you can see, a level 5 RAID offers abactor of roughly 10 
improvement in .performance, reliability, and power .consumption (and 
Vnce air conditioning. costs) and t a factor of 3, reduction in size, over this 
J LED. Table VII also compares a ievel $ RAID using 1Q inexpensive data 
disks with a group size of 10 to a Fujitsu M236I A "Super Eaglet In this 
comparison RAID offers roughly a factor of 5 improvement in 
performance, power consumption, and size with more than, two orders of 
magnitude .improvement in (calculated) reliability. 

RAID, offers, the further advantage of modular growth oyer SLED. 
Rather . than being limited tp 7,500 MB per increase for 3100,000 as in 
the case of this model of I&M disk, RAIDs can grow, at either the group 
size (1000 MB for $11,000) or. if partial.groups are allowed,, at ihe j disk 
size (100 MB for $1 ,10$. The flip side, of the coin is that RAID also 
makes sense in systems considerably smaller than a SLED. Small 
incremental costs also makes, hot standby spares practical to further reduce 
M l J K and thereby increase the MTTF of a large system. For example, a 
1 000 disk Jevel 5 RAID with a group size of 10 and a few standby spares 
could have a calculated' M l rh of over 45 years. 

A final comment concerns the prospect of designing a complete 
transaction processing system from either a Level 1 or Level 5 RAID. The 
drastically lower power per megabyte of inexr^erisive disks allows systems 
designers to consider battery backup for the. whole disk array—the power 
needed for 110 PC disks is less than two Fujitsu Super Eagles. Another 
a pproach would be, to use a few such, flisks to save the, contents of battery 



backed-up main memory in the event of an extended power failure. The 

smaller capacity of these dUsks also ties up less of the database during 

reconstruction, leading to higher availability. (Note that Level 5 lies up 

all the disks in' a group in event of failure while Level 1 only needs the 

single mirrored disk during reconstruction, giving Level 1 the edge in 

availability). 

13. Conclusion 

RAIDs offer a cost' effective option to meet the challenge of 
exponential growth in the processor arid memory speeds. We believe the 
size reduction of personal computer 1 disks is a key to the success of disk 
arrays, just as Gordon Bell argues that the size reduction of 
microprocessors is a key 10 the success in multiprocessors [Bell 85]. In 
both cases the smaller size simplifies the interconnection of the many 
components as well as packaging and cabling. While large, arrays of 
mainframe processors (of SLEDs) are possible, it is certainly easier to 
construct an array from the same number of microprocessors (or PC 
drives). Just as Bell coined the term, "miilti" to distinguish a 
multiprocessor rnade from microprocessors, we use the term "RAID" to 
identify a disk array made from personal computer disks. 

With advantages in cost-performance, reliability, power consumption, 
and modular growth, we expect RAIDs to replace SLEDs in future I/O 
systems. There are*, however, several open issues that may bare on the 
practicality of RAIDs: 
° What is the impact of a RAID on latency? 

• What- is the unpad on MTTF calculations of non-exponential failure 
assumptions for individual disks? 

• What wilt be the real lifetime of a RAID vs. calculated MTTF using the 
independent failure. model? 

• How would synchronized disks affect level 4 and 5 RAID performance?- 

• How does "slowdown? S actually -behave? (iJvny 87] 

• .How do defective sectors affect RAID? 

• How do you schedule I/O to level 5 RAID's to maximize write 

• Is there locality of reference of disk accesses in transaction processing? 

• Can, information, he. xsutomaticalty re&sikbutetLoyer 100 jq 2000 diski 
to reduce contention? 

• Will disk controller design limit RAID performance? 

• How should 100 to 1000 disks be constructed and physically connected 
to the processor? 

• What is the. impact of cabling on cost, performance, and reliability? 

• Where should a RAID be connected to a CPU so as not to limit 
performance? Memory bust * I/O pus? Cache? 

• Can a file system 'allow differ striping 'policies for different files? 

• What is the role of solid sjaie a?sks and WORMs m a RAID? 

• What is the impact on RAID of "parallel access" disks (access to every 
surface under the read/write head in parallel)? 
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Figure S.Pfot of Large (Grouped) and Small (Individual) 
Read~M.odify~Writes per. second per disk and useable storage 
capacity for all five levels of RAID (D=I00, G=10). We 
assume a single S factor uniformly for all levels, with S=I3 
where it is needed. 



Table VII. Comparison of IBM 3380 disk model AK4 to Level 5 RAID, using 
100 Conners & Associates CP 3100s disks and a group size of 10 and a comparison 
of the Fujitsu M236IA "Super Eagle" to a level 5 RAID using 10 inexpensive data 
disks with a group size of 10. Numbers greater than I in the comparison columns 
favor the RAID. 
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Appendix: Reliability Calculation 

* Using probability theory we can calculate the MTTF Group . We first 
assume independent and exponential failure rates. Our model uses a biased 
coin with tjbe probability of heads being the probability that a second 
failure will occur within the MTTR of a first failure. Since disk failures 
are exponential: - 

Probabi U ty(ai least one of the r emain ing disks failing in MTTR) 
= [ 1 - (c-MTTR/MTTF^sij^G+C-l) j 

In all practical cases 

MTl^Disk 

MTTR « 

G+C 

and since (1 - e* x ) is approximately X for 0 c X « 1 : 

Probability (at least one of the remaining disks failing in MTTR) 
« MTTR^CG+C-O/MTTFijijic . 

Then that on a disk failure we flip this coin: . 

heads => a system crash, because a second failure occurs before the 

fust was repaired; 
. tails => recover from error and continue. 
Then 

MTTF Group = "tJcpectediTime between Failures) 

° Expected [no. of flips until fim heads] 

Expecieo , (Tirne between Failures] 

Probability(heads) 

MTTFoisic 



(G4Q*(MTTR*(G+C-l)/MTnF Di$lc ) 

(MTTFbisk) 2 
MTTF GroU p = __ 

(G+C)*(G+C-I)*MTTR 

Group failure is not precisely exponential in our model, but we have 
v alidat ed this simplifying assumption for practical cases of MTTR « 
MTTF /(G+C). this makes the MTTF of the whole system just 
MTTFcroup divid ed by ihe number of groups. n Gm 
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